Install these two packages and comment out the codes.
#install.packages("Rselenium")
#install.packages("rvest")
library(RSelenium)
library(rvest)
Static Web Pages are web pages that display the fixed contents and deliver the same pages to all users;
Dynamic Web Pages are pages that can display different contents and information change occurs frequently, some also provide user interaction.
If you tried scraping a Dynamic web page, you might not obtain dynamic information by rvest::read_html().
Rselenium Package to obtain dynamic webpages.
rvest::read_html() to read the webpage in r.Rselenium Package is a tool to navigate websites and it can be combined with the rvest package to scrape dynamic web pages.
Check which version your browser is running on. Depending on your browser type, I am using Chrome here. Click the top right three dots -> Help -> About Google Chrome to see which version you are on.
For mine, it indicates that Chrome is running on Version 91.0.4472.77(Version 91)
Next download the current Selenium web drivers for your Chrome version.
Then select the chromedriver corresponding your computer system.(I am using mac here)
Download the latest stable version of Selenium Server and save it to the appropriate folder.
Open the terminal and change the directory to where you save the Selenium Server .jar and then enter java -Dwebdriver.chrome.driver=chromedriver -jar selenium-server-standalone-3.141.59.jar. Here -jar selenium-server-standalone-3.141.59.jar is the name of the file you just downloaded.
Once you see that “Selenium Server is up and running on port 4444”, then Selenium Server has been successfully installed and ready to use.
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, browserName ="chrome")
remDr$open()
If everything above worked correctly, a new chrome window will open. This window should look like this:
Then you are able to read the dynamic website using following code:
remDr$navigate("https://www.columbia.edu/")
page <- rvest::read_html(remDr$getPageSource()[[1]])
Always to check if it has permissions to access page(s)
#install.packages("robotstxt")
robotstxt::paths_allowed("https://www.columbia.edu/")
##
www.columbia.edu
## [1] TRUE
If “TRUE” is returned, then we continue the process; otherwise you you should google or reach out for help.
HTML is organized using tags, which are surrounded by <> symbols. A basic html page structure should look like this:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
Use the html_nodes() and html_text() functions respectively to search for this tag and retrieve the relative text. You will need to put the path in the html_nodes(). To get the path, right click the particular line from the web on Chrome, and choose “Inspect”. -> Right click the choosen line, click “copy->copy full Xpath”. Then we will get the path of the line we choose from the web.`
For example if we want to extract word “Commencement” from Columbia website:
library(tidyverse)
page %>%
html_nodes(xpath = "/html/body/div[1]/div[3]/main/div[2]/section/div/div/div/article/div/div[2]/div[3]/div/div/div/div[4]/div/a/div/div/div/span") %>%
html_text()
## [1] "Commencement"
[Note"] There are many resources for parsing pdf documents. I will introduce pdftools package and tabulizer package in this tutorial.
#install.packages("pdftools")
#install.packages("tabulizer")
library(pdftools)
## Using poppler version 0.73.0
library(tabulizer)
download.file("https://www.cdc.gov/nchs/data/vsrr/vsrr012-508.pdf","document.pdf")
document <- pdf_text("document.pdf")
read_lines() function reads the lines.str_trim() to remove whitespace from start and end of string.strsplit() to separate lines from each other, we will need to use regular expressions. You could test your regular expression here.ldply() to transform the result from a list to a data frame.document <- document %>%
.[6] %>%
read_lines() %>%
.[6:17] %>%
str_trim() %>%
strsplit(split = "\\s{2,}") %>%
plyr::ldply()
document[1,][2:5] <- paste0(c(rep("2020 ",2), rep("2019 ",2)), document[1,][2:5])
colnames(document) <- document[1,]
document <- document[-1,]
document
table <- tabulizer::extract_tables(file = "https://www.cdc.gov/nchs/data/vsrr/vsrr012-508.pdf", pages = 6, output = "data.frame")[[1]]
table
table <- table[,c(1,3,5,6,8)]
colnames(table) <- paste0(c("Age of mother", rep("2020 ", 2), rep("2019 ",2)),table[1,])
table <- table[-1,]
table